157 research outputs found
Hierarchical testing designs for pattern recognition
We explore the theoretical foundations of a ``twenty questions'' approach to
pattern recognition. The object of the analysis is the computational process
itself rather than probability distributions (Bayesian inference) or decision
boundaries (statistical learning). Our formulation is motivated by applications
to scene interpretation in which there are a great many possible explanations
for the data, one (``background'') is statistically dominant, and it is
imperative to restrict intensive computation to genuinely ambiguous regions.
The focus here is then on pattern filtering: Given a large set Y of possible
patterns or explanations, narrow down the true one Y to a small (random) subset
\hat Y\subsetY of ``detected'' patterns to be subjected to further, more
intense, processing. To this end, we consider a family of hypothesis tests for
Y\in A versus the nonspecific alternatives Y\in A^c. Each test has null type I
error and the candidate sets A\subsetY are arranged in a hierarchy of nested
partitions. These tests are then characterized by scope (|A|), power (or type
II error) and algorithmic cost. We consider sequential testing strategies in
which decisions are made iteratively, based on past outcomes, about which test
to perform next and when to stop testing. The set \hat Y is then taken to be
the set of patterns that have not been ruled out by the tests performed. The
total cost of a strategy is the sum of the ``testing cost'' and the
``postprocessing cost'' (proportional to |\hat Y|) and the corresponding
optimization problem is analyzed.Comment: Published at http://dx.doi.org/10.1214/009053605000000174 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Rank discriminants for predicting phenotypes from RNA expression
Statistical methods for analyzing large-scale biomolecular data are
commonplace in computational biology. A notable example is phenotype prediction
from gene expression data, for instance, detecting human cancers,
differentiating subtypes and predicting clinical outcomes. Still, clinical
applications remain scarce. One reason is that the complexity of the decision
rules that emerge from standard statistical learning impedes biological
understanding, in particular, any mechanistic interpretation. Here we explore
decision rules for binary classification utilizing only the ordering of
expression among several genes; the basic building blocks are then two-gene
expression comparisons. The simplest example, just one comparison, is the TSP
classifier, which has appeared in a variety of cancer-related discovery
studies. Decision rules based on multiple comparisons can better accommodate
class heterogeneity, and thereby increase accuracy, and might provide a link
with biological mechanism. We consider a general framework ("rank-in-context")
for designing discriminant functions, including a data-driven selection of the
number and identity of the genes in the support ("context"). We then specialize
to two examples: voting among several pairs and comparing the median expression
in two groups of genes. Comprehensive experiments assess accuracy relative to
other, more complex, methods, and reinforce earlier observations that simple
classifiers are competitive.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS738 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Statistical Framework for Image Category Search from a Mental Picture
Image Retrieval; Relevance Feedback; Page Zero Problem; Mental Matching; Bayesian System; Statistical LearningStarting from a member of an image database designated the “query image,” traditional image retrieval techniques, for example search by visual similarity, allow one to locate additional instances of a target category residing in the database. However, in many cases, the query image or, more generally, the target category, resides only in the mind of the user as a set of subjective visual patterns, psychological impressions or “mental pictures.” Consequently, since image databases available today are often unstructured and lack reliable semantic annotations, it is often not obvious how to initiate a search session; this is the “page zero problem.” We propose a new statistical framework based on relevance feedback to locate an instance of a semantic category in an unstructured image database with no semantic annotations. A search session is initiated from a random sample of images. At each retrieval round the user is asked to select one image from among a set of displayed images – the one that is closest in his opinion to the target class. The matching is then “mental.” Performance is measured by the number of iterations necessary to display an image which satisfies the user, at which point standard techniques can be employed to display other instances. Our core contribution is a Bayesian formulation which scales to large databases. The two key components are a response model which accounts for the user's subjective perception of similarity and a display algorithm which seeks to maximize the flow of information. Experiments with real users and two databases of 20,000 and 60,000 images demonstrate the efficiency of the search process
Identification of direction in gene networks from expression and methylation
Background: Reverse-engineering gene regulatory networks from expression data is difficult, especially without temporal measurements or interventional experiments. In particular, the causal direction of an edge is generally not statistically identifiable, i.e., cannot be inferred as a statistical parameter, even from an unlimited amount of non-time series observational mRNA expression data. Some additional evidence is required and high-throughput methylation data can viewed as a natural multifactorial gene perturbation experiment. Results: We introduce IDEM (Identifying Direction from Expression and Methylation), a method for identifying the causal direction of edges by combining DNA methylation and mRNA transcription data. We describe the circumstances under which edge directions become identifiable and experiments with both real and synthetic data demonstrate that the accuracy of IDEM for inferring both edge placement and edge direction in gene regulatory networks is significantly improved relative to other methods. Conclusion: Reverse-engineering directed gene regulatory networks from static observational data becomes feasible by exploiting the context provided by high-throughput DNA methylation data. An implementation of the algorithm described is available at http://code.google.com/p/idem/
Interpretable by Design: Learning Predictors by Composing Interpretable Queries
There is a growing concern about typically opaque decision-making with
high-performance machine learning algorithms. Providing an explanation of the
reasoning process in domain-specific terms can be crucial for adoption in
risk-sensitive domains such as healthcare. We argue that machine learning
algorithms should be interpretable by design and that the language in which
these interpretations are expressed should be domain- and task-dependent.
Consequently, we base our model's prediction on a family of user-defined and
task-specific binary functions of the data, each having a clear interpretation
to the end-user. We then minimize the expected number of queries needed for
accurate prediction on any given input. As the solution is generally
intractable, following prior work, we choose the queries sequentially based on
information gain. However, in contrast to previous work, we need not assume the
queries are conditionally independent. Instead, we leverage a stochastic
generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select
the most informative query about the input based on previous query-answers.
This enables the online determination of a query chain of whatever depth is
required to resolve prediction ambiguities. Finally, experiments on vision and
NLP tasks demonstrate the efficacy of our approach and its superiority over
post-hoc explanations.Comment: 29 pages, 14 figures. Accepted as a Regular Paper in Transactions on
Pattern Analysis and Machine Intelligenc
Recommended from our members
Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases.
RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.BACKGROUND: Identification of molecular classifiers from genome-wide gene expression analysis is an important practice for the investigation of biological systems in the post-genomic era--and one with great potential for near-term clinical impact. The 'Top-Scoring Pair' (TSP) classification method identifies pairs of genes whose relative expression correlates strongly with phenotype. In this study, we sought to assess the effectiveness of the TSP approach in the identification of diagnostic classifiers for a number of human diseases including bacterial and viral infection, cardiomyopathy, diabetes, Crohn's disease, and transformed ulcerative colitis. We examined transcriptional profiles from both solid tissues and blood-borne leukocytes. RESULTS: The algorithm identified multiple predictive gene pairs for each phenotype, with cross-validation accuracy ranging from 70 to nearly 100 percent, and high sensitivity and specificity observed in most classification tasks. Performance compared favourably with that of pre-existing transcription-based classifiers, and in some cases was comparable to the accuracy of current clinical diagnostic procedures. Several diseases of solid tissues could be reliably diagnosed through classifiers based on the blood-borne leukocyte transcriptome. The TSP classifier thus represents a simple yet robust method to differentiate between diverse phenotypic states based on gene expression profiles. CONCLUSION: Two-transcript classifiers have the potential to reliably classify diverse human diseases, through analysis of both local diseased tissue and the immunological response assayed through blood-borne leukocytes. The experimental simplicity of this method results in measurements that can be easily translated to clinical practice
Variational Information Pursuit for Interpretable Predictions
There is a growing interest in the machine learning community in developing
predictive algorithms that are "interpretable by design". Towards this end,
recent work proposes to make interpretable decisions by sequentially asking
interpretable queries about data until a prediction can be made with high
confidence based on the answers obtained (the history). To promote short
query-answer chains, a greedy procedure called Information Pursuit (IP) is
used, which adaptively chooses queries in order of information gain. Generative
models are employed to learn the distribution of query-answers and labels,
which is in turn used to estimate the most informative query. However, learning
and inference with a full generative model of the data is often intractable for
complex tasks. In this work, we propose Variational Information Pursuit (V-IP),
a variational characterization of IP which bypasses the need for learning
generative models. V-IP is based on finding a query selection strategy and a
classifier that minimizes the expected cross-entropy between true and predicted
labels. We then demonstrate that the IP strategy is the optimal solution to
this problem. Therefore, instead of learning generative models, we can use our
optimal strategy to directly pick the most informative query given any history.
We then develop a practical algorithm by defining a finite-dimensional
parameterization of our strategy and classifier using deep networks and train
them end-to-end using our objective. Empirically, V-IP is 10-100x faster than
IP on different Vision and NLP tasks with competitive performance. Moreover,
V-IP finds much shorter query chains when compared to reinforcement learning
which is typically used in sequential-decision-making problems. Finally, we
demonstrate the utility of V-IP on challenging tasks like medical diagnosis
where the performance is far superior to the generative modelling approach.Comment: Code is available at
https://github.com/ryanchankh/VariationalInformationPursui
Identifying Personalized Metabolic Signatures in Breast Cancer.
Cancer cells are adept at reprogramming energy metabolism, and the precise manifestation of this metabolic reprogramming exhibits heterogeneity across individuals (and from cell to cell). In this study, we analyzed the metabolic differences between interpersonal heterogeneous cancer phenotypes. We used divergence analysis on gene expression data of 1156 breast normal and tumor samples from The Cancer Genome Atlas (TCGA) and integrated this information with a genome-scale reconstruction of human metabolism to generate personalized, context-specific metabolic networks. Using this approach, we classified the samples into four distinct groups based on their metabolic profiles. Enrichment analysis of the subsystems indicated that amino acid metabolism, fatty acid oxidation, citric acid cycle, androgen and estrogen metabolism, and reactive oxygen species (ROS) detoxification distinguished these four groups. Additionally, we developed a workflow to identify potential drugs that can selectively target genes associated with the reactions of interest. MG-132 (a proteasome inhibitor) and OSU-03012 (a celecoxib derivative) were the top-ranking drugs identified from our analysis and known to have anti-tumor activity. Our approach has the potential to provide mechanistic insights into cancer-specific metabolic dependencies, ultimately enabling the identification of potential drug targets for each patient independently, contributing to a rational personalized medicine approach
- …